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Abstract — We propose a dynamic spectrum access scheme 
where secondary users cooperatively recommend "good" chan- 
nels to each other and access accordingly. We formulate the 
problem as an average reward based Markov decision process. 
We show the existence of the optimal stationary spectrum access 
policy, and explore its structure properties in two asymptotic 
cases. Since the action space of the Markov decision process is 
continuous, it is difficult to find the optimal policy by simply 
discretizing the action space and use the policy iteration, value 
iteration, or Q-learning methods. Instead, we propose a new al- 
gorithm based on the Model Reference Adaptive Search method, 
and prove its convergence to the optimal policy. Numerical 
results show that the proposed algorithms achieve up to 18% 
and 100% performance improvement than the static channel 
recommendation scheme in homogeneous and heterogeneous 
channel environments, respectively, and is more robust to channel 
dynamics. 



I. Introduction 

Cognitive radio technology enables unlicensed secondary 
wireless users to opportunistically share the spectrum with 
licensed primary users, and thus offers a promising solution to 
address the spectrum under-utilization problem [[T]. Designing 
an efficient spectrum access mechanism for cognitive radio 
networks, however, is challenging for several reasons: (1) 
time-variation: spectrum opportunities available for secondary 
users are often time-varying due to primary users' stochastic 
activities |[T1; and (2) limited observations: each secondary 
user often has a limited view of the spectrum opportunities 
due to the limited spectrum sensing capability [|2]. Several 
characteristics of the wireless channels, on the other hand, 
turn out to be useful for designing efficient spectrum access 
mechanisms: (1) temporal correlations: spectrum availabilities 
are correlated in time, and thus observations in the past can 
be useful in the near future ||3]; and (2) spatial correlation: 
secondary users close to one another may experience similar 
spectrum availabilities ID. In this paper, we shall explore the 
time and space correlations and propose a recommendation- 
based collaborative spectrum access algorithm, which achieves 
good communication performances for the secondary users. 

Our algorithm design is directly inspired by the recom- 
mendation system in the electronic commerce industry. For 
example, existing owners of various products can provide 
recommendations (reviews) on Amazon.com, so that other 
potential customers can pick the products that best suit their 
needs. Motivated by this, Li in fS) proposed a static channel 
recommendation scheme, where secondary users recommend 
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Fig. 1. Illustration of the channel recommendation scheme. User D 
recommends channel 4 to other users. As a result, both user A and user 
C access the same channel 4, and thus lead to congestion and a reduced rate 
for both users. 



the channels they have successfully accessed to nearby sec- 
ondary users. Since each secondary user originally only has 
a limited view of spectrum availability, such information 
exchange enables secondary users to take advantages of the 
correlations in time and space, make more informed decisions, 
and achieve a high total transmission rate. 

The recommendation scheme in [5 J, however, is rather static 
and does not dynamically change with network conditions. In 
particular, the static scheme ignores two important characteris- 
tics of cognitive radios. The first one is the time variability we 
mentioned before. The second one is the congestion effect. As 
depicted in Figure [T] too many users accessing the same good 
channel leads to congestion and a reduced rate for everyone. 

To address the shortcomings of the static recommendation 
scheme, in this paper we propose an adaptive channel rec- 
ommendation scheme, which adaptively changes the spectrum 
access probabilities based on users' latest channel recommen- 
dations. We formulate and analyze the system as a Markov 
decision process (MDP), and propose a numerical algorithm 
that always converges to the optimal spectrum access policy. 

The main results and contributions of this paper include: 

• Markov decision process formulation: we formulate and 
analyze the optimal recommendation-based spectrum ac- 
cess as an average reward MDP. 
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• Existence and structure of the optimal policy, we show 
that there always exists a stationary optimal spectrum 
access policy, which requires only the channel recom- 
mendation information of the most recent time slot. We 
also explicitly characterize the structure of the optimal 
stationary policy in two asymptotic cases (either the 
number of users or the number of users goes to infinity). 

• Novel algorithm for finding the optimal policy: we pro- 
pose an algorithm based on the recently developed Model 
Reference Adaptive Search method |l6l to find the optimal 
stationary spectrum access policy. The algorithm has a 
low complexity even when dealing with a continuous 
action space of the MDP. We also show that it always 
converges to the optimal stationary policy. 

• Superior Performance: we show that the proposed al- 
gorithm achieves up to 18% performance improvement 
than the static channel recommendation scheme and 10% 
performance improvement than the Q-learning method, 
and is also robust to channel dynamics. 

The rest of the paper is organized as follows. We introduce 
the system model and the static channel recommendation 
scheme in Sections HH and HITaI respectively. We then discuss 
the motivation for designing an adaptive channel recommen- 
dation scheme in Section IIII-BI The Markov decision process 
formulation and the structure results of the optimal policy are 
presented in Section |IV] followed by the Model Reference 
Adaptive Search based algorithm in Section |V] We illustrate 
the performance of the algorithm through numerical results in 
Section IVIII We discuss the related work in Section IVIIII and 
conclude in Section HXl 

II. System Model 

We consider a cognitive radio network with Af paral- 
lel and stochastically heterogeneous primary channels. N 
homogeneous secondary users try to access these channels 
using a slotted transmission structure (see Figure |2). The 
secondary users can exchange information by broadcasting 
messages over a common control channef]. We assume that 
the secondary users are located close-by, thus they experience 
similar spectrum availabilities and can hear one another's 
broadcasting messages. To protect the primary transmissions, 
secondary users need to sense the channel states before their 
data transmission. 

The system model is described as follows: 

• Channel state: For each primary channel to, the channel 
state at time slot t is 

0, if channel m is occupied by 
Sm{t) — \ primary transmissions, 

1, if channel m is idle. 

• Channel state transition: The states of different channels 
change according to independent Markovian processes 
(see Figure O. We denote the channel state probability 
vector of channel m at time t as ^^(t) = {Pr{Sm{t) = 

' Please refer to |3 for the details on how to set up and maintain a reliable 
common control channel in cognitive radio networks. 



0}, Pr{Sm{t) = 1}), which follows a two-state Markov 
chain as p„i{t) = Pmi^ ~ l)rm,Vt > 1, with the 
transition matrix 

1 Pm Prn 
Qm 1 Qm 

Note that when p„i = or q,n = 0, the channel state 
stays unchanged. In the rest of the paper, we will look 
at the more interesting and challenging cases where < 
Pm < 1 and < gm < 1- The stationary distribution of 
the Markov chain is given as 



lim Pr{S„,{t) = 0} 

t—>-QO 

lim Pr{S„,it) = 1} = 



Pri 



(1) 
(2) 

Prn + Qrn 

Heterogeneous channel throughput: When a secondary 
user transmits successfully on an idle channel m, it 
achieves a data rate of Bm- Different channels can 
support different data rates. 

Channel Contention: To resolve the transmission collision 
when multiple secondary users access the same channel, a 
backoff mechanism is used (see Figure |2]for illustration). 
The contention stage of a time slot is divided into A* 
mini-slots, and each user n executes the following two 
steps: 

1) Count down according to a randomly and uniformly 
chosen integral backoff time (number of mini-slots) 
A„ between 1 and A*. 

2) Once the timer expires, monitor the channel and 
transmit RTS/CTS messages to grab the channel if 
the channel is clear (i.e., no ongoing transmission). 
Note that if multiple users choose the same backoff 
mini-slot, a collision will occur with RTS/CTS 
transmissions and no users can grab the channel. 
Once successfully grabing the channel, the user 
starts to transmit its data packet. 

Suppose that km users choose channel to to access. 
Then the probability that user n (out of the km users) 
successfully grabs the channel to is 

Prn = Pr{min{Ai, Afc„J = A„} 

A* 



1 



A}Pr{min{AJ > A|A„ 



A} 



A=l 



A* - A 
A* 



(3) 



For the ease of exposition, we focus on the asymptotic 
case where A* goes to oo. This is a good approximation 
when the number of mini-slots A* for backoff is much 
larger than the number of users N and collisions rarely 
occur It simplifies the analysis as 



A-Too A* ^ A* 
A=l 



A. 



= 1, 



and thus the expected throughput of user n is 

i,\ BrnSm{t) 
'^n[t) = ; . 



(4) 



(5) 



3 



Spectrum 
Sensing 


Channel 
Contention 


Data 
Transmission 


Channel 
Recommendation 
and Selection 



Prr 



Fig. 2. Structure of each spectram access time slot 
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III. Introduction To Channel Recommendation 

In this section, we first give a review of the static channel 
recommendation scheme in in [!5|| and then discuss the moti- 
vation for adaptive channel recommendation. 



A. Review of Static Channel Recommendation 

The key idea of the static channel recommendation scheme 
is that secondary users inform each other about the available 
channels they have just accessed. More specifically, each sec- 
ondary user executes the following four stages synchronously 
during each time slot (See Figure |2); 

• Spectrum sensing: sense one of the channels based on 
channel selection result made at the end of the previous 
time slot. 

• Channel Contention: if the channel sensing result is idle, 
compete for the channel with the backoff mechanism 
described in Section 

• Data transmission: transmit data packets if the user 
successfully grabs the channel. 

• Channel recommendation and selection: 

- Announce recommendation: if the user has success- 
fully accessed an idle channel, broadcast this channel 
ID to all other secondary users. 

- Collect recommendation: collect recommendations 
from other secondary users and store them in a 
buffer Typically, the correlation of channel avail- 
abilities between two slots diminishes as the time 
difference increases. Therefore, each secondary user 
will only keep the recommendations received from 
the most recent W slots and discard the out-of-date 
information. The user's own successful transmission 
history within W recent time slots is also stored in 
the buffer. is a system design parameter and will 
be further discussed later 

- Select channel: choose a channel to sense at the 
next time slot by putting more weights on the rec- 
ommended channels according to a static branching 
probability Prec- Suppose that the user has < i? < 



M different channel recommendations in the buffer, 
then the probability of accessing a channel m is 

if channel ra is recommended, 
otherwise. 

(6) 

A larger value of Prec means that putting more 
weight on the recommended channels. When i? = 
(no channel is recommended) or M (all channels are 
recommended), the random access is used and the 
probability of selecting channel m is P,„ = jj. 
To illustrate the channel selection process, let us take the 
network in Figure[T]as an example. Suppose that the branching 
probability Prec = 0.4. Since only R = \ recommendation is 
available (i.e., channel 4), the probabilities of choosing the 
recommended channel 4 and any unrecommended channel are 
^ = 0.4 and = 0.12, respectively. 

Numerical studies in ||5] showed that the static channel 
recommendation scheme achieves a higher performance over 
the traditional random channel access scheme without infor- 
mation exchange. However, the fixed value of Prec limits the 
performance of the static scheme, as explained next. 

B. Motivations For Adaptive Channel Recommendation 

The static channel recommendation mechanism is simple to 
implement due to a fixed value of Prec- However, it may lead 
to significant congestions when the number of recommended 
channels is small. In the extreme case when only R ~ \ 
channel is recommended, calculation (|6]l suggests that every 
user will access that channel with a probability Prec- When 
the number of users N is large, the expected number of 
users accessing this channel NPrec will be high. Thus heavy 
congestion happens and each secondary user will get a low 
expected throughput. 

A better way is to adaptively change the value of Prec based 
on the number of recommended channels. This is the key 
idea of our proposed algorithm. To illustrate the advantage 
of adaptive algorithms, let us first consider a simple heuristic 
adaptive algorithm in a homogeneous channel environment, 
i.e., for each channel m, its data rate Bm = B and channel 
state changing probabilities pm = P,qm = q- In this algorithm, 
we choose the branching probability such that the expected 
number of secondary users choosing a single recommended 
channel is one. To achieve this, we need to set Prec as in 
Lemma [T] 

Lemma 1. If we choose the branching probability Prec ~ 
then the expected number of secondary users choosing any 
one of the R recommended channels is one. 

Due to space limitations, we give the detailed proof of 
Lemma[T]in [?]. Without going through detailed analysis, it is 
straightforward to show the benefit for such adaptive approach 
through simple numerical examples. Let us consider a network 
with M = 10 channels and = 5 secondary users. For 
each channel m, the initial channel state probability vector is 
Pm(0) — (0, 1) and the transition matrix is 

1-O.Ole O.Ole 
~ O.Ole 1- O.Ole 
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where e is called the dynamic factor. A larger value of 
e implies that the channels are more dynamic over time. 
We are interested in time average system throughput U = 
z^f=i ^n=i ' ^ where u„(i) is the throughput of user n at 
time slot t. In the simulation, we set the total number of time 
slots T = 2000. 

We implement the following three channel access schemes: 

• Random access scheme: each secondary user selects a 
channel randomly. 

• Static channel recommendation scheme as in |5] with the 
optimal constant branching probability Prec = 0.7. 

• Heuristic adaptive channel recommendation scheme with 
the variable branching probability Prec = j^- 

Figure 2] shows that the heuristic adaptive channel recom- 
mendation scheme outperforms the static channel recommen- 
dation scheme, which in turn outperforms the random access 
scheme. Moreover, the heuristic adaptive scheme is more 
robust to the dynamic channel environment, as it decreases 
slower than the static scheme when e increases. 



^ 2.3 
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Fig. 4. Comparison of three channel access schemes 

We can imagine that an optimal adaptive scheme (by setting 
the right Prec{t) over time) can further increase the net- 
work performance. However, computing the optimal branching 
probability in closed-form is very difficult. In the rest of the 
paper, we will focus on characterizing the structures of the 
optimal spectrum access strategy and designing an efficient 
algorithm to achieve the optimum. 

IV. Adaptive Channel Recommendation Scheme 

We first study the optimal channel recommendation in the 
homogeneous channel environment, i.e., each channel m has 
the same data rate B„i = B and identical channel state 
changing probabilities pm ~ p, Qm ~ Q- The generalization 
to the heterogeneous channel setting will be discussed in 
Section [VT] To find the optimal adaptive spectrum access 
strategy, we formulate the system as a Markov Decision 
Process (MDP). For the sake of simplicity, we assume that the 
recommendation buffer size W — 1, i.e., users only consider 
the recommendations received in the last time slot. Our method 
also applies to the case when W > 1 hy using a high-order 
MDP formulation, although the analysis is more involved. 



A. MDP Formulation For Adaptive Channel Recommendation 
We model the system as a MDP as follows: 
« System state: R G TZ = {0, 1, min{i\/, A^}} denotes 

the number of recommended channels at the end of time 

slot t. Since we assume that all channels are statistically 

identical, then there is no need to keep track of the 

recommended channel ID^. 
« Action: Prec G "P = (0, 1) denotes the branching 

probability of choosing the set of recommended channels. 
« Transition probability: The probability that action Prec in 

system state R in time slot t will lead to system state R' 

in the next time slot is 

PrJ' = Pr{R{t+l) = i?'|i?(t) = R,Prec{t) = Prec}- 

We can compute this probability as in dTji, with detailed 
derivations given in Appendix ICl 
• Reward: U{R,Prec) is the expected system throughput 
in the next time slot when the action Prec is taken under 
the current system state R, i.e.. 



U{R,Prec)^ J2 PrM'UR' 



Rew 



where Uw is the system throughput in state R'. If R' idle 
channels are utilized by the secondary users in a time slot, 
then these R' channels will be recommended at the end 
of the time slot. Thus, we have 



R'B. 



Recall that B is the data rate that a single user can obtain 
on an idle channel. 
• Stationary Policy: tt G 51 = T^'^' maps each state R to 
an action Prec, i-C-, is the action Prec taken when 

the system is in state R. The mapping is stationary and 
does not depend on time t. 
Given a stationary policy n and the initial state Rq G TZ, 
we define the network's value function as the time average 
system throughput, i.e. 



$^(i?o) = lim 



T-1 



J2uiRit),7riRm 



t=o 



We want to find an optimal stationary policy tt* that maxi- 
mizes the value function $^(i?o) for any initial state Rq, i.e. 

TT* = argmax<i>7r(-Ro), Vi?o G Tl- 

Notice that this is a system wide optimization, although the 
optimal solution can be implemented in a distributed fashion. 
This is because every user knows the number of recommended 
channels R, and it can determine the same optimal access 
probability locally. For example, each user can calculate the 
optimal spectrum access policy off-line, and determine the 
real-time optimal channel access probability P ec locally by 
observing the number of recommended channels R after 
entering the network. 

^Users need to know the IDs of the recommended channels in order to 
access them. However, the IDs are not important in terms of MDP analysis. 
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m 



(R-fhr)\ 



rir — 1 
I \ mr — 1 



P + Q P + Q 



(M-fl)! f 



(7) 



B. Existence of Optimal Stationary Policy 

MDP formulation above is an average reward based MDP. 
We can prove that an optimal stationary policy that is in- 
dependent of initial system state always exists in our MDP 
formulation. The proof relies on the following lemma from 
0. 

Lemma 2. If the state space is finite and every stationary 
policy leads to an irreducible Markov chain, then there exists a 
stationary policy that is optimal for the average reward based 
MDP. 

The irreducibility of Markov chain means that it is possible 
to get to any state from any state. For the adaptive channel 
recommendation scheme, we have 

Lemma 3. Given a stationary policy tt for the adaptive 
channel recommendation MDP, the resulting Markov chain is 
irreducible. 

Proof: We consider the following two cases: 

Case I, when < q < 1: since < Prec < 1, < p < 1, 
and < g < 1, we can verify that given any state R, the 
transition probability P > for all R G TZ. Thus, any 
two states communicate with each other. 

Case II, when q ~ 1: for all R <E TZ, the transition 
probability P^'^'j^, > if i?' G {0, min{A/ - R,N}}. It 
follows that the state i?' = is accessible from any other 
state i? G 7?.. By setting i? = 0, we see that P^ ^j^, > 0, for 
all R' G {0, min{7\/, N}}. That is, any other state R' e TZ 
is also accessible from the state R = 0. Thus, any two states 
communicate with each other. 

Since any two states communicate with each other in all 
cases and the number of system state \TZ\ is finite, the resulting 
Markov chain is irreducible. ■ 

Combining Lemmas |2] and [51 we have 

Theorem 1. There exists an optimal stationary policy for the 
adaptive channel recommendation MDP. 

Furthermore, the irreducibility of the adaptive channel rec- 
ommendation MDP also implies that the optimal stationary 
policy TT* is independent of the initial state i?o ID, i.e. 

$,.(i?o) = $.*,Vi?oe7e, 

where $7r' is the maximum time average system throughput. 
In the rest of the paper, we will just use "optimal policy" 
to refer "optimal stationary policy that is independent of the 
initial system state". 



C. Structure of Optimal Stationary Policy 

Next we characterize the structure of the optimal policy 
without using the closed-form expressions of the policy (which 
is generally hard to achieve). The key idea is to treat the 
average reward based MDPs as the limit of a sequence of 
discounted reward MDPs with discounted factors going to 
one. Under the irreducibility condition, the average reward 
based MDP thus inherits the structure property from the 
corresponding discounted reward MDP [iSj. We can write down 
the Bellman equations of the discounted version of our MDP 
problem as: 

Vt{R) = ^max^ ^ P'"^j,[UR,+pVt+i{R')]yR G TZ, (8) 
R'en 

where Vt{R) is the discounted maximum expected system 
throughput starting from time slot t when the system in state 
R. 

Due to the combinatorial complexity of the transition prob- 
ability P^"^! in (|7|i, it is difficult to obtain the structure results 
for the general case. We further limit our attention to the 
following two asymptotic cases. 

1 ) Case One, the number of channels M goes to infinity 
while the number of users N stays finite: In this case, 
the number of channels is much larger than the number of 
secondary users, and thus heavy congestion rarely happens 
on any channel. Thus it is safe to emphasizing on accessing 
the recommended channels. Before proving the main result of 
Case One in Theorem |2] let us first characterize the property 
of discounted maximum expected system payoff Vt{R). 

Proposition 1. When Al = oo and N < oo , the value 
function Vt{R) for the discounted adaptive channel recom- 
mendation MDP is nondecreasing in R . 

The proof of Proposition [T] is given in the Appendix. Based 
on the monotone property of the value function Vt{R), we 
prove the following main result. 

Theorem 2. When M = oo and N < oo, for the adaptive 
channel recommendation MDP, the optimal stationary policy 
TT* is monotone, that is, 7r*(i?) is nondecreasing on R TZ. 

Proof: For the ease of discussion, we define 

Qt{R,Prec)^ P^:^^[C/fl-+/3Ft+i(i?')], 

R'en 

with the partial cross derivative being 

dRdPrec dPrec 

dP 



6 



By Lemma 6 in the Appendix, we know the reverse cumulative 
distribution function J^n/en ^fiR' supermodular on TixV. 
It impUes 



R'en ^ R+i,R' 



R'en^ R,R' 



> 0. 



dPrec QPrec 

Since Vt+i{R') is nondecreasing in R' by Proposition [T] 
and Ur, = R'B, we know that Ur, + /3Vf+i(i?') is also 
nondecreasing in R' . Then we have 

d ER'en PrTi,R' pR' + PVt+i (R')] 



dPr. 



> 



dER'enPRM'pR'+PVt+ijR')] 



I.e., 



d^Qt{R,Pr, 



> 0, 



dRdPrec 

which implies that Qt{R, Prec) is supermodular on TZ x V. 
Since 

7T*{R) = aigma,xQt{R, Prec), 

by the property of super-modularity, the optimal policy 7r*(i?) 
is nondecreasing on R for the discounted MDP above. Since 
the average reward based MDP inherits its structure property, 
this result is also true for the adaptive channel recommendation 
MDP ■ 
2) Case Two, the number of users N goes to infinity while 
the number of channels M stays finite: In this case, the 
number of secondary users is much larger than the number 
of channels, and thus congestion becomes a major concern. 
However, since there are infinitely many secondary users, all 
the idle channels at each time slot can be utilized as long 
as users have positive probabilities to access all channels. 
From the system's point of view, the cognitive radio network 
operates in the saturation state. Formally, we show that 

Theorem 3. When N = oo and AI < oo, for the adaptive 
channel channel recommendation MDP, any stationary policy 
TT satisfying 

< tt{R) < l,Vi? £ 7^, 

is optimal. 

Proof: We first define the sets of policies A = {tt : 
< tt{R) < l,Vi? e 7^} and A'^ = n\A. Recall that the 
value of tt{R) equals the probability of choosing the set of 
recommended channels, i.e., Prec- 

Then it is easy to check that the probability of accessing 
an arbitrary channel m is positive under any policy tt G A. 
Since the number of secondary users iV = oo, it implies that 
all the channels will be accessed by the secondary users. In 
this case, the transition probability from a system state R to 
R' of the resulting Markov chain is given by 



P. 



R,R' 



E 



nir+m^=R' ,rnr<R,mu<M-R 



R 



rrir R—rrir 



M ~R 



M-R-m^, 



p + q 



p + q 



(9) 



which is independent of the branching probability Tr{R). It 
implies that any policy tt G A leads to a Markov chain with the 
same transition probabilities Pr 'ri- Thus, any policy tt e A 
offers the same time average system throughput. 

We next show that any policy tt' G A"^ leads to a payoff 
no better than the payoff of a policy tt G A. For a policy tt' 
where there exists some states R such that tt'{R) = 0, the 
transition probability from the system state R to R' is 



37r'(fl) 
R.R' 




l_P_\R' ( q \M-R-R' 

If R' <M - R, 
If R' > M - R. 



If there exists some states R such that Tr'{R) = 1, we have 
the transition probability as 



pTT'iR) 



R.R' 



Since 




; (1 If i?' < R, 



and 



If R' > R. 



p + q' ^p + q' 
M-R \ ^ p ^R'^ q ^M-R-R' 

R' J p + q p + q 



q yi~R-j 

J J ' p + q' "p + q 



R 
R' 



R'R-R' 



compared with (|9]l, we have 



RI 



M 



E Pr^R^ > E G U^neAye A^ 

R.'=i R'=i 

Suppose that the time horizon consists of any T time slots, 
and V.^{R) denotes the expected system throughput under the 
policy TT by starting from time slot t when the system in state 
R. 

When t = T, 

Vf{R) = Vf'{R) 
= Ur 

= RB,yRen,n e A,'k' e A". 
It follows that Ur + pVf{R) ^ Ur + pVf' {R), and hence 

M 

Y^pi\^}[u{R)+pv^m 



R'=0 
M 



> Y.^B!wPiR) + m'{R)], 



R'=0 



7 



I.e., 

Vr^iiR) > V^'_i{R),yR e 7e, TT e A, tt' e 
Recursively, for any time slots t <T, we can show that 

^r(-R) > yt\R)i^R e 71,71 e A,Tr' e A". 

Thus, if there exists a policy tt' G A"^ that is optimal, then all 
the policies tt e A is also optimal. If there does not exist such 
a policy tt', then we conclude that only the policy tt e A is 
optimal. ■ 

V. Model Reference Adaptive Search For Optimal 
Spectrum Access Policy 

Next we will design an algorithm that can converge to the 
optimal policy under general system parameters (not limiting 
to the two asymptotic cases). Since the action space of the 
adaptive channel recommendation MDP is continuous (i.e., 
choosing a probability P^ec in (0, 1)), the traditional method 
of discretizing the action space followed by the policy, value 
iteration, or Q-learning cannot guarantee to converge to the 
optimal policy. To overcome this difficulty, we propose a 
new algorithm developed from the Model Reference Adaptive 
Search method, which was recently developed in the Opera- 
tions Research community ||6|. We will show that the proposed 
algorithm is easy to implement and is provably convergent to 
the optimal policy. 



where /{ro} is an indicator function, which equals 1 if 
the event m is true and zero otherwise. Parameter vq 
is the initial parameter for the probabilistic model (used 
during the first iteration, i.e., k = 1), and gk~i{x) is the 
reference distribution in the previous iteration (used when 
k > 2). 

Probabilistic model update: update the parameter v of the 
probabilistic model f{x,v) by minimizing the Kullback- 
Leibler divergence between gk{x) and f{x,v), i.e. 



Vk+i = argminiJ, 



ak 



In 



9k{x) 



(11) 



By constructing the reference distribution according to (ITOl ). 
the expected performance of random elite solutions can be 
improved under the new reference distribution, i.e.. 



{J(x)>7 



}] 



> E, 



gk-i 



(12) 



To find a better solution to the optimization problem, it is 
natural to update the probabilistic model (from which random 
solution are generated in the first stage) as close to the new 
reference probability as possible, as done in the third stage. 



A. Model Reference Adaptive Search Method 

We first introduce the basic idea of the Model Reference 
Adaptive Search (MRAS) method. Later on, we will show 
how the method can be used to obtain optimal spectrum access 
policy for our problem. 

The MRAS method is a new randomized method for global 
optimization ||6l. The key idea is to randomize the original 
optimization problem over the feasible region according to 
a specified probabilistic model. The method then generates 
candidate solutions and updates the probabilistic model on the 
basis of elite solutions and a reference model, so that to guide 
the future search toward better solutions. 

Formally, let J{x) be the objective function to maximize. 
The MRAS method is an iterative algorithm, and it includes 
three phases in each iteration k: 

• Random solution generation: generate a set of random 
solutions {x} in the feasible set x according to a 
parameterized probabilistic model f{x,Vk), which is a 
probability density function (pdf) with parameter Vk- 
The number of solutions to generate is a fixed system 
parameter. 

• Reference distribution construction: select elite solutions 
among the randomly generated set in the previous phase, 
such that the chosen ones satisfy J{x) > 7. Construct a 
reference probability distribution as 

' -f{J(x)>^} , _ -, 

9k{x) = { , , (10) 



B. Model Reference Adaptive Search For Optimal Spectrum 
Access Policy 

In this section, we design an algorithm based on the MRAS 
method to find the optimal spectrum access policy. Here we 
treat the adaptive channel recommendation MDP as a global 
optimization problem over the policy space. The key challenge 
is the choice of proper probabilistic model /(•), which is 
crucial for the convergence of the MRAS algorithm. 

1) Random Policy Generation: To apply the MRAS 
method, we first need to set up a random policy generation 
mechanism. Since the action space of the channel recommen- 
dation MDP is continuous, we use the Gaussian distributions. 
Specifically, we generate sample actions 7r(i?) from a Gaussian 
distribution for each system state R E TZ independently, i.e. 
7r(_R) ~ A/'(/^ij, cr^jll In this case, a candidate policy tt can 
be generated from the joint distribution of \TZ\ independent 
Gaussian distributions, i.e., 

(^(0),...,7r(miii{A/,iV})) ^ ^^{^lo,al) x ■ ■ ■ 

xA/'(/imin{Jl/,Af}, fmin{Jl/,Af})- 

As shown later, Gaussian distribution has nice analytical and 
convergent properties for the MRAS method. 

For the sake of brevity, we denote /(7r(i?), cTij) as the 
pdf of the Gaussian distribution Af{iiR, (jjj), and f{Tr,fi,a) 

'Note that the Gaussian distribution has a support over (—00, +00), which 
is larger than the feasible region of 7r(i?). This issue will be handled in Section 
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1 



as random policy generation mechanism with parameters fi = 

(/^O, Mmin{M,7V}) and <T = (cTq , • ■ • , CTminf A/^W} i-B-, 
min{A/,iV} 

/(7r,/2,cr) = Jl f{TT{R),fiR,aR) 

R=0 
min{M,N} 

where ip is the circumference-to-diameter ratio. 

2) System Throughput Evaluation: Given a candidate pol- 
icy TT randomly generated based on /(tt, cr), we need 
to evaluate the expected system throughput From (|7|l, 
we obtain the transition probabilities P^^, for any system 
state R, R' G TZ. Since a policy tt leads to a finitely irre- 
ducible Markov chain, we can obtain its stationary distribution. 
Let us denote the transition matrix of the Markov chain 
as Q = [P^^']\'R\x\TZ\ and the stationary distribution as 
p = {Pr{0), Pr{mm{M, N})). Obviously, the stationary 
distribution can be obtained by solving the following equation 

PQ = P 

We then calculate the expected system throughput by 

Ren 

Note that in the discussion above, we assume that tt € 
implicitly, where is the feasible policy space. Since Gaus- 
sian distribution has a support over (— cx),+oo), we thus 
extend the definition of expected system throughput over 

{—oo, +oo)l''^l as 



— oo Otherwise. 



In this case, whenever any generated policy tt is not feasible, 
we have = — cxd. As a result, such policy tt will not be 
selected as an elite sample (discussed next) and will not used 
for probability updating. Hence the search of MRAS algorithm 
will not bias towards any unfeasible policy space. 

3) Reference Distribution Construction: To construct the 
reference distribution, we first need to select the elite policies. 
Suppose L candidate policies, tti, 7r2, tt^, are generated at 
each iteration. We order them based on an increasing order of 
the expected system throughputs i.e., $7rj < < ... < 
$7rj^, and set the elite threshold as 

7 — *^*r(i-p)i,i ' 

where < p < 1 is the elite ratio. For example, when i = 100 
and p — 0.4, then 7 = $7r6o and the last 40 samples in the 
sequence will be selected as elite samples. Note that as long 
as L is sufficiently large, we shall have 7 < 00 and hence 
only feasible policies tt are selected. According to (fTOl i. we 
then construct the reference distribution as 



— T— 



9kM 



■9*3-1 -fx >t}I 



(13) 



fc > 2. 



4) Policy Generation Update: For the MRAS algorithm, 
the critical issue is the updating of random policy generation 
mechanism /(tt, /x, cr), or solving the problem in (fTTT i. The 
optimal update rule is described as follow. 

Theorem 4. The optimal parameter (/x, cr) that minimizes the 
Kullback-Leibler divergence between the reference distribution 
(7fc(7r) in f ITj] ) and the new policy generation mechanism 
/(tt, p., cr) is 



(14) 



Proof: First, from (fTTt , we have 



Vi? e n. 



(15) 



51 w 



^/{7r,pio,CTo)L/(7r,^o,cro)J 



'{*,r>7} 



Len-^{*->7} 



dn 



and, 

52(71-) 



e^"/{$^>-y}ffi(7r) 
^gi ['2*''^{*x>7}] 



e*''/{$^>'y}/{$^>7} 



{*,.>7} 



Repeat the above computation iteratively, we have 



,fc > 1. 



Then, the problem in (fTTT i is equivalent to solving 



™^ /^eo5fe(7r)ln./(7r,M,cr)d7r, 



(16) 



(17) 



subject to fi,cr >: 0, 

Substituting (fT6] l into (fTTj i. we have 

subject to P,cr h 0, 

Function f{TT{R), fiR, an) is log-concave, since it is 
the pdf of the Gaussian distribution. Since the log- 
concavity is closed under multiplication, then /(tt, fi, cr) = 

Y['R=a^^'^^ f{'^{R')^ I^R^'^r) is also log-concave. It implies 
the problem in ( fTTT i is a concave optimization problem. Solving 
by the first order condition, we have 

^Len e^''~^^*''-^{*->7} ln/(7r, p, cr)d7r 



dfiR 

^Len g^'°~^^'*''-^{t'.>7} ln/(7r, fi, cr)d7r 

d(7R 



= 0,Vi?e7^, 



0,Vi? e TZ, 



9 



which leads to ( fT4l i and ( fTST l. Due to the concavity of the 
optimization problem in ( fT7] i. the solution is also the global 
optimum for the random policy generation updating. ■ 
5) MARS Algorithm For Optimal Spectrum Access Policy: 
Based on the MARS algorithm, we generate L candidate 
polices at each iteration. Then the updates in (fl4l i and (flSl l 
are replaced by the sample average version in (l24l and (l25t . 
respectively. As a summary, we describe the MARS-based 
algorithm for finding the optimal spectrum access policy of 
adaptive channel recommendation MDP in Algorithm [T| 

C. Convergence of Model Reference Adaptive Search 

In this part, we discuss the convergence property of the 
MRAS-based optimal spectrum access policy. For ease of ex- 
position, we assume that the adaptive channel recommendation 
MDP has a unique global optimal policy. Numerical studies 
in 161 show that the MRAS method also converges where 
there are multiple global optimal solutions. We shall show 
that the random policy generation mechanism /(tt, /x^,, crj.) 
will eventually generate the optimal policy. 

Theorem 5. For the MRAS algorithm, the limiting point of the 
policy sequence {tt^} generated by the sequence of random 
policy generation mechanism {/(tt, /i.^., ctj;)} converges point- 
wisely to the optimal spectrum access policy tt* for the 
adaptive channel recommendation MDP, i.e., 

^lim = T:*{R)yRen, (19) 



lim Var 



'/(7r,/Xfc,<Tfc) 



O^R e n. 



(20) 



The proof is given in the Appendix. 

From Theorem |5] we see that parameter {nR.k,<^R.k) for 
updating in (l24l i and (IZST i also converges, i.e., 

lim iJ.R,k = n*{R),\fR G 7^, 

k—>oo 



lim cri?,fc 

k—¥QO 



o,Vi? G n. 



Thus, we can use ma.XR^-ji cr^ < ^ as the stopping criterion 
in Algorithm [U 

VI. Adaptive Channel Recommendation With 
Channel Heterogeneity 

We now generalize the adaptive channel recommendation 
to the heterogeneous channel setting. Recall that the system 
state R in the homogeneous channel case only keeps track 
of how many channels are recommended. In a heterogeneous 
channel environment, each channel has different a data rate 
Bm and channel state changing probabilities pm and qm- 
Keeping track of the number of recommend channels is not 
enough for optimal decision. Intuitively, if a channel with 
higher data rate Bm is recommended, users should choose 
this channel with a higher weight. The new system state for 
the heterogeneous channel case should be defined as a vector 
R = {Ii, where /,„ = 1 if channel m is recommended 
and /,„ = otherwise. The objective of the heterogeneous 
channel recommendation MDP is then to find the optimal 
channel access probabilities {PmiR)}m=i for each system 



Algorithm 1 MRAS-based Algorithm For Adaptive Recom- 
mendation Based Optimal Spectrum Access 
1: Initialize parameters for Gaussian distributions (/i,Q,cro), 
the elite ratio p, and the stopping criterion ^. Set initial 
elite threshold 70 = and iteration index fc = 0. 
2: repeat; 

3: Increase iteration index fc by 1. 
4: Generate L candidate policies Tri,...,??^ from the 

random policy generation mechanism /(tt, <Tfc_i). 
5: Select eUte policies by setting the elite threshold 7^. = 

max{$#^(^_^,^^,7fc_i}. 
6; Update the random policy generation mechanism by 

WR^n, 



l^R,k 



'RM 



7: until max_RgK o-R,k < 



(21) 

-, Vi? G n. 

(22) 



state R where Pm{R) is the probability of selecting channel 

TO. 

Similarly with the homogeneous channel case, we can apply 
the MRAS method to obtain the optimal solutions with the 
new formulation. However, the number of decision variables 
{Pm{P)}m=i ill heterogeneous channel model equals to 
Af2^^, which causes exponential blow up in the computational 
complexity. We next focus on developing a low complexity 
efficient heuristic algorithm to solve the MDP. 

Recall that in the heuristic algorithm in Lemma[T]for the ho- 
mogeneous channel recommendation, the weight of selecting 
each recommended channel is and total weights of choosing 
recommended channels are Rjj^- Similarly, we can design 
a low complexity heuristic algorithm for the heterogeneous 
channel recommendation. More specifically, we set the weight 
of selecting channel to is P"' (P™, respectively) when the 
channel is recommended (the channel is not recommended, 
respectively). Given the system is in state R, the probability 
of choosing channel to is proportional to its weight of its state 

^rn-> i.e., 



Prn{R) 



pn 



E^^ r>7n 
rn' = l 



(23) 



In this case, the total number of decision variables P"' 
is reduced to 2M, which grows linearly in the number of 
channels M. Let 7? = {(P;", Po'")}m=i G (0, l)2^^denote the 
set of corresponding decision variables. Our objective is to find 
the optimal 7? that maximizes the time average throughput 
We can again apply the MRAS method to find the optimal 
solution, which is given in Algorithm |2] The procedures of 
derivation is very similar with the MRAS method for the 
homogeneous channel recommendation; we omit the details 
due to space limit. 

Note that the optimal policy tt* for the heuristic hetero- 
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Algorithm 2 MRAS-based Algorithm For Optimizing Heuris- 
tic Heterogeneous Channel Recommendation 
1: initialize parameters for the elite ratio p, Gaussian 
distributions /x(0) = {^"(0), M^(0))}iLi, t(0) = 
{(tT5"(0),tT^(0))},^f^i, and the stopping criterion ^. Set 
initial elite threshold 70 = and iteration index A; = 0. 
2: repeat: 

3: increase iteration index fc by 1. 

4: generate L candidate policies tti,...,tt]^ from the 

random policy generation mechanism /(??, /i,(fc— 1), cr{k— 

!))■ 

5: select elite policies by setting the elite threshold 7^ = 

max|$^ ,7j._i|. 
>- ^r(i-P)i.i' ' ' 
6: update the random policy generation mechanism by 

(for any /,„ G {0, 1}, m e M) 

»=ie^ ' "-'{<i.^^>7fc}-^/,„ 



(24) 



(25) 



7: until maxj,,^g{o,i}.™eA^cr?l(^) < 



geneous channel recommendation is also a feasible policy 
for the heterogeneous channel recommendation MDR The 
performance of the optimal policy for the heterogeneous 
channel recommendation MDP thus dominates the heuristic 
heterogeneous channel recommendation. However, numerical 
results show that the heuristic heterogeneous channel rec- 
ommendation has a small performance loss comparing to 
the optimal policy while gaining a significant computation 
complexity reduction. 

VII. Simulation Results 

In this section, we investigate the proposed adaptive channel 
recommendation scheme by simulations. The results show 
that the adaptive channel recommendation scheme not only 
achieves a higher performance over the static channel recom- 
mendation scheme and random access scheme, but also is more 
robust to the dynamic change of the channel environments. 

A. Simulation Setup 

We first consider a cognitive radio network consisting of 
multiple independent and stochastically homogeneous primary 
channels. The data rate of each channel is normalized to be 1 
Mbps. In order to take the impact of primary user's long run 
behavior into account, we consider the following two types of 
channel state transition matrices: 



Type 1: = 



Type 2: = 



1 - 0.005e O.OOSe 
0.025e 1 - 0.025e 

1-O.Ole O.Ole 
O.Ole 1- O.Ole 



(26) 
(27) 



where e is the dynamic factor Recall that a larger e means that 
the channels are more dynamic over time. Using (|2), we know 
that channel models and have the stationary channel idle 
probabilities of 1/6 and 1/2, respectively. In other words, the 
primary activity level is much higher with the Type 1 channel 
than with the Type 2 channel. 

We initialize the parameters of MRAS algorithm as follows. 
We set = 0.5 and (Jr = 0.5 for the Gaussian distribution, 
which has 68.2% support over the feasible region (0, 1). 
We found that the performance of the MRAS algorithm is 
insensitive to the elite ratio p when p < 0.3. We thus choose 

/9=:0.1. 

When using the MRAS-based algorithm, we need to de- 
termine how many (feasible) candidate policies to generate 
in each iteration. Figure |5] shows the convergence of MRAS 
algorithm with 100, 300, and 500 candidate policies per itera- 
tion, respectively. We have two observations. First, the number 
of iterations to achieve convergence reduces as the number of 
candidate policies increases. Second, the convergence speed is 
insignificant when the number changes from 300 to 500. We 
thus choose L = 500 for the experiments in the sequel. 




Number of canddate policy L=500 
Number of canddate policy L^OO 
Number of canddate policy L=1 00 



Fig. 5. The convergence of MRAS-based algorithm with different number 
of candidate policies per iteration 



B. Simulation Results 

We implement the adaptive channel recommendation 
scheme with AI ~ 10 channels and iV = 5 secondary users. 
We also benchmark the adaptive channel recommendation 
scheme with the static channel recommendation scheme in fS) 
and the random access scheme as the benchmark. We choose 
the dynamic factor e within a wide range to investigate the 
robustness of the schemes to the channel dynamics. The results 
are shown in Figures |6]-|9] From these figures, we see that 

• Superior performance of adaptive channel recommen- 
dation scheme (Figures^ and^: the adaptive channel 
recommendation scheme performs better than the ran- 
dom access scheme and static channel recommendation 
scheme. Typically, it offers 5%~18% performance gain 
over the static channel recommendation scheme. 

• Impact of channel dynamics (Figures |6] and 0: the 
performances of both adaptive and static channel rec- 
ommendation schemes degrade as the dynamic factor e 
increases. The reason is that both two schemes rely on the 
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recommendation information from previous time slots to 
make decisions. When channel states change rapidly, the 
value of recommendation information diminishes. How- 
ever, the adaptive channel recommendation is much more 
robust to the dynamic channel environment changing (See 
Figure |9|l. This is because the optimal adaptive policy 
takes the channel dynamics into account while the static 
one does not. 

• Impact of channel idleness level (Figures |S] and 
Figure [8] shows the performance gain of the adaptive 
channel recommendation scheme over the random access 
scheme under two different types of transition matrix 
scenarios. We see that the performance gain decreases 
with the idle probability of the channel. This shows that 
the information of channel recommendations can enhance 
the spectrum access more efficiently when the primary 
activity level increases (i.e., when the channel idle prob- 
ability is low). Interestingly, Figure |9] shows that the per- 
formance gain of the adaptive channel recommendation 
scheme over the static channel recommendation scheme 
trends to increase with the channel idleness probability. 
This illustrates that the adaptive channel recommendation 
scheme can better utilize the channel opportunities given 
the information of channel recommendations. 



1.3 




0.5^ ' ' ' ' ' ^ 

1 5 10 15 20 25 30 

Dynamic Factor e 

Fig. 6. System throughput with M = 10 channels and N = 5 users under 
the Type 1 channel state transition matrix 
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Fig. 7. System throughput with AI = 10 channels and N = 5 users under 
the Type 2 channel state transition matrix 
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Fig. 8. Performance gain over random access scheme. The Type 1 and Type 
2 channels have the stationary channel idle probabilities of 1/6 and 1/2, 
respectively. 




Fig. 9. Performance gain over static channel recommendation scheme. The 
Type 1 and Type 2 channels have the stationary channel idle probabilities of 
1/6 and 1/2, respectively. 



C. Comparison of MRAS algorithm and Q-Learning 

To benchmark the performance of the spectrum access 
policy based on the MRAS algorithm, we compare it with 
the policy obtained by Q-learning algorithm ||9]- 

Since the Q-learning can only be used over the discrete 
action space, we first discretize the action space V into a finite 
discrete action space V = {0.1, 1.0}. The Q-learning then 
defines a Q-value representing the estimated quality of a state- 
action combination as Q : TZxPrec ^ Given a new reward 
U{R{t), Prec{t)) is received, we can update the Q-value to be 

QiRit),Precit)) = {l-a)Q{R{t),Prec{t)), 

+ a[U{R{t),Prec{t))+ max Q{R{t + 1), Prec)], 

where < a < 1 is the smoothing factor Given a system state 
R, the probability of choosing an action P^ec is Pr {Prec (t) ~ 
Prec\R{t) = i?) = — "xg(fl!p^,c) , where r > is the 
temperature. 

After the Q-leaming converges, we obtain the corresponding 
spectrum access policy ttq over the discretized action space P. 
Note that ttq is a sub-optimal policy for the adaptive channel 
recommendation MDP over the continuous action space P. 
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We compare the Q-leaming based policy with our MRAS- 
based optimal policy when there are AI — 10 channels and 
iV = 5 users, and show the simulation results in Figures [TOjand 
nn From these figures, we see that the MRAS-based algorithm 
outperforms Q-leaming up to 10%, which demonstrates the 
effectiveness of our proposed algorithm. 



- O Learning For Adaptive Channel Recommendation 

- MRAS Algorithm For Adaptive Channel Recommendation 

- Static Channel Recommendation 




environments are at most 12% and 5%, respectively. This 
shows the efficiency of the heuristic heterogeneous channel 
recommendation in homogeneous channel environments. 
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Fig. 12. Comparison of heuristic lieterogenous cliannel recommendation 
and optimal homogeneous channel recommendation in Type 1 homogeneous 
channel environment. 



Fig. 10. Comparison of MRAS-based algorithm and Q-leaming with Type 
1 channel state transition matrix 



- Q Learning For Adaptive Channel Recommendation 

- MRAS Algorithm For Adaptive Channel Recommendation 

- Static Channel Recommendation 




- Random Access 

- Optimal Homogeneous Channel Recommendation 

- Heuristic Heterogeneous Channel Recommendatbn 




[Dynamic Factcre 

Fig. 13. Comparison of heuristic heterogenous channel recommendation 
and optimal homogeneous channel recommendation in Type 2 homogeneous 
channel environment. 
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Fig. 11. Comparison of MRAS-based algorithm and Q-leaming with Type 
2 channel state transition matrix 



D. Heuristic Heterogenous Channel Recommendation 

We now evaluate the proposed heuristic heterogeneous 
channel recommendation mechanism in Section |Vl] with a 
network consisting of M = 10 channels and = 5 users. 
We implement the heuristic heterogeneous channel recom- 
mendation mechanism in both homogeneous and heterogenous 
homogeneous environments. 

1 ) Homogeneous Channel Environment: We first study how 
the heuristic heterogeneous channel recommendation mech- 
anism performs in the homogeneous channel environment 
(which is a special case of the heterogeneous environment) in 
both types of and F^ homogeneous channel environments, 
and simulate the optimal homogeneous channel recommenda- 
tion (Algorithm [T]! as a benchmark. . The data rate of each 
channel is normalized to be 1 Mbps. The results are shown in 
Figures [12] and [13] Comparing to the optimal channel access 
policy, the performance loss of the heuristic heterogeneous 
channel recommendation in the Type 1 and Type 2 channel 



- Optimal Homogeneous Ctianne! Recommendation 

- Heuristic Hetercgenecus Channel Recommendation 
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Fig. 14. Comparison of heuristic heterogenous channel recommendation, 
optimal homogeneous channel recommendation and optimal homogeneous 
channel recommendation in the first kind of heterogenous channel environ- 
ment. 



2) Heterogeneous Channel Environment: We next imple- 
ment the heuristic heterogeneous channel recommendation 
mechanism in heterogenous channel environments. The data 
rates of M = 10 channels are {Bi = 0.2,52 = 0.6,^3 = 
0.8,^4 - 1,^5 - 2,Bq ^ 4,B7 - 6,^8 - 8,^9 = 
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Fig. 15. Compaiison of heuristic heterogenous channel recommendation, 
optimal homogeneous channel recommendation and optimal homogeneous 
channel recommendation in the second kind of heterogenous channel envi- 
ronment. 



10, BiQ = 20} Mbps. We consider two kinds of stochastic 
channel state changing environments: 

{Fi = r2, r2 = r2, = r2, r4 = r^, v, - r^, 

Te^T\Tj = T\Ts^T\Tg = r\ Tio = r^}, (28) 

and 



r2,r7 = 



r2,r9 = r2,rio = r2}. 



(29) 



Here subscript denotes channel index, and superscript denote 
channel type index. For the first kind of channel environment, 
a channel with low data rate tends to have a low primary 
transmission occupancy. While for the second kind, a channel 
with high data rate tends to have a high idleness probability. 
We also implement static channel recommendation, the opti- 
mal homogeneous channel recommendation (Algorithm[T) and 
optimal heterogeneous channel recommendation (obtained by 
adapting the MRAS algorithm to optimize the heterogeneous 
channel MDP, not shown in this paper) as benchmarks. The 
results are depicted in Figures [14] and [15] From these figures, 
we see that: 

• For the first kind of channel environment, the heuristic 
heterogeneous channel recommendation achieves up-to 
40% and 100% performance improvement over the opti- 
mal homogeneous channel recommendation and the static 
channel recommendation, respectively. Comparing with 
the optimal heterogeneous channel recommendation, the 
performance loss of the heuristic heterogeneous channel 
recommendation is at most 35%. Note that the number of 
decision variables in the optimal heterogeneous channel 
recommendation is M2^' — 10240, while the number of 
decision variables in the heuristic heterogeneous channel 
recommendation is only 2AI ~ 20. The convergence 
of the heuristic heterogeneous channel recommendation 
hence is much faster than the optimal heterogeneous 
channel recommendation. 

• For the second kind of channel environment, the heuris- 
tic heterogeneous channel recommendation achieves up- 
to 70% and 100% performance improvement over the 
optimal homogeneous channel recommendation and static 



channel recommendation, respectively. The performance 
loss is at most 20% comparing with the the optimal 
heterogeneous channel recommendation. Comparing with 
Figure [14] we see that the heuristic heterogeneous chan- 
nel recommendation performs better if more channel 
opportunities are available for the secondary users. 

VIII. Related Work 

The spectrum access by multiple secondary users can be ei- 
ther uncoordinated or coordinated. For the uncoordinated case, 
multiple secondary users compete with other for the resource. 
Huang et al. in |fTO| designed two auction mechanisms to 
allocate the interference budget among selfish users. Southwell 
and Huang in lITTI studied the largest and smallest convergence 
time to an equilibrium when secondary users access multiple 
channels in a distributed fashion. Liu et al. in |[T2l modeled 
the interactions among spatially separated users as congestion 
games with resource reuse. Li and Han in |[T3l applied the 
graphic game theory to address the spectrum access problem 
with Umited range of mutual interference. Anandkumar et al. 
in lO proposed a learning-based approach for competitive 
spectrum access with incomplete spectrum information. Law 
et al. in flSl showed that uncoordinated spectrum access may 
lead to poor system performance. 

For the coordinated spectrum access, Zhao et al. in |[T6l 
proposed a dynamic group formation algorithm to distribute 
secondary users' transmissions across multiple channels. Shu 
and Krunz proposed a multi-level spectrum opportunity frame- 
work in IITtI . The above papers assumed that each secondary 
user knows the entire channel occupancy information. We 
consider the case where each secondary user only has a limited 
view of the system, and improve each other's information by 
recommendation. 

Our algorithm design is partially inspired by the recommen- 
dation systems in the electronic commerce industry, where an- 
alytical methods such as collaborative filtering llsl and multi- 
armed bandit process modeling |fT9l are useful. However, we 
cannot directly apply the existing methods to analyze cognitive 
radio networks due to the unique congestion effect in our 
model. 

IX. Conclusion 

In this paper, we propose an adaptive channel recommenda- 
tion scheme for efficient spectrum sharing. We formulate the 
problem as an average reward based Markov decision process. 
We first prove the existence of the optimal stationary spectrum 
access policy, and then characterize the structure of the optimal 
policy in two asymptotic cases. Furthermore, we propose a 
novel MRAS-based algorithm that is provably convergent to 
the optimal policy. Numerical results show that our proposed 
algorithm outperforms the static approach in the literature by 
up to 18% and the Q-leaming method by up to 10% in terms 
of system throughput. Our algorithm is also more robust to 
the channel dynamics compared to the static counterpart. 

In terms of future work, we are currently extending the 
analysis by taking the heterogeneity of channels into con- 
sideration. We also plan to consider the case where the 
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secondary users are selfish. Designing an incentive-compatible 
channel recommendation mechanism for that case will be very 
interesting and challenging. 

Appendix 

A. Proof of Letmna ?? 

When S„i(t) — 0, this trivially holds. We focus on the case 
that S'„(t) ^ 1. 

Let JCm = {1, ...jkmit)} be the set of secondary users 
accessing the channel m, rl be the backoff time be generated 
by secondary user i and r,„ = min{r^|z ^ n,i € /Cm}- The 
probability that the user n captures the channel m is given as 



fc,„(t)-l 



(1 



Thus, the expected throughput of user n is 



BPr„ 



1 



r" 1 
B{1 dr;; 

^max ^max 



B 



BSm{t) 



B. Proof of Lemma Q] 

Let Ac denote the event that C secondary users choose the 
recommended channels, and Pr(ci, ...,Cfl) denote probability 
mass function that the number of secondary users on these R 
recommended channels equal to ci , . . . , respectively. Given 
the event Ac, we have 



Pr(ci,...,Cfl|Ac) = 



n 



Cl, ...,C_R 



R 



which is a Multinomial mass function. By the property of 
Multinomial distribution, we have 



C 
R' 



It follows that the expected number of users choosing a 
recommended channel m is 



N 



E[cm] = ^£;[c,„|Ac]Pr(Ac) 



c=o 

N 



C=0 
PrecN 

R ■ 

Then E[cm] = 1 requires that 



P 

^rec - ^ ■ 



C. Derivation of Transition Probability 

When the system state transits from R to R, we assume that 
nir and m„ recommendations, out of R' recommendations, 
are channels that have been recommended and have not 
been recommended at time slot t respectively. Obviously, 
rrir + rriu = R ■ We assume that m,. recommended channels 
and ifiu unrecommended channels have been accessed by the 
secondary users at time slot t+\. We thus have R > fhr > rrir 
and M — R > rhu > rriu- We also assume that there are 
rir secondary users have accessed these recommended 
channels and n„ secondary users have accessed those ?7i„ 
unrecommended channels at time slot t + 1. Obviously, we 
have Ur + riu ~ N , rir > fhr and > to„. 

For the first term, the probability that the user distribu- 
tion {rirjUu) happens follows the Binomial distribution as 

prir n — P V" 



N 



For the second term, when rrir > 1, it is easy to check 
that there are ^ ^ ^ ways for rir secondary users to 

choose rhr recommended channels and there are 
possibilities for these fhr recommended channels out of the R 
recommended channels, each of which has probability (-^)"''. 
Among these fhr recommended channels that have been ac- 
cessed by the secondary users, the probability that rrir channels 



turn out to be idle is given as 



(1 



qynrqfnr 



|--| When iTLr = 0, it requires that Ur = 0. Thus, we define 



rir — 1 

-1 



1 If nr=0, 
Otherwise. 



□ 



Similarly, we can obtain the third term for the unrecom- 
mended channels case. 



D. Lemma 5 

Since the operation X^/f/eTC ^i? I?'!'] pl^ys a key role in the 
Bellman equation, to facilitate the study, we first define the 
following function 

min{7\/,Af} 

fr{R,Prec) = ^ P^'J , Vr £ 7^. 

i—r 

Since 

fr{,Ri Prec) 

= Pr{R{t + 1) > r\R{t) = R, P„.c(0 = Prec) 

= 1 - Pr{R{t + 1) < r\R{t) = R, Prec{t) = Prec), 

We call the function fr{R,Prec) as the reverse cumulative 
distribution function in the sequel. 

Lemma 4. When M = oo and N < oo, the reverse 
cumulative distribution function fr{R, Prec) is nondecreasing 
in R for all r,Ren, Prec £ V. 

proof: We prove the result by induction argument. In abuse 
of notation, we denote the transition probability P^'f^i and 
the reverse cumulative distribution function fr{R, Prec) when 
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the number of users N ^ k sls Pj^''^>{k) and f^{R, Prec) 
respectively. 

When N ^ 2, from (|7]i, we have 



p + q 

q 



rec 1 

P + q 



{p + qY 

<r(2) = (i-p.ecf(^r, 



~\~^Prec{^ Prec) 



P + q 



P[rm = PXAl - q) + - Prec? ^^'^ 



-|-2/^7-gc (l Prec ) 



{p + qY 

(1 - q)q+pq 



P + q 



J5 + g 



^^2X2) = pl'-^ + n - Pre.n^r 

z p + q 

-h2_P7-ec(l ~ Prec^ ! ; 

p + q 



plrm - pl '-'^['-'^' + {i-Pr - 



(1 -q)q+pq 



P + q 



p-r(2) - 



2-frec( 1 Prec) 



(1 - g)p 



It is easy to check the following holds 

pj:r(2) > p/:r(2)>prr(2), 

p[rm > p/:r(2)>Po>(2). 

Since 

E (2) = E (2) = E (2) = 1. 

i=0 i=0 i=0 

we thus obtain 

/2(P + i,P„,) > /2(p,p,,,),VP,r e 7e,P,,ee e P, 

i.e. /r(P, Prec) IS uoudecreasiug in R for the case N ~ 2. 

We then assume that fr{R,Prec) is nondecreasing in P for 
all Ren, Prec e P for the case that N = k>2 i.e. 

/,^(P + 1, P,ec) > /r Prec), VP, r £ P, P^ec G P. 



We next prove that /^(P, P^ec) is nondecreasing for the case 
the N = k + 1 under this hypothesis. 

Let denote the event that one arbitrary user out of these 
fc + 1 users, does not generate a recommendation at time slot 
t + 1. Obviously, 

Pr(V') = Precq + (1 - Prec)^—, 

P + q 

which depends on Prec and the channel environment only. By 
conditioning on the event ip, we have 



PR+i/k + l) = P^^ll^_,ik)[l^PrW] 



+P^+Uk)Pr{ij), 



Thus, 



p^^jik+i) = p^^j_,m-prm- 

+P^RTik)PrW 



f!: + \R + 1, Prec) - fr+\R, Prec) 

E^^?+i.(fc+i)-E^«T(^+i) 



(30) 



(31) 



i—r 



i—r 



= E^«+i.-iW - E^^?"-i(fc)Hi - Prin 

i—r i—r 

+E^&!.(fc)-E^fl:r(fc)]^K^) 

i—r i—r 

= [E^«Ti,,(^) - E^«7(^)Hi - prm 



3=r 
k 



+E^^+i.w-E^S;r(fc)]^K^) 

i—r i—r 
= [/^i(P + 1, Prec) - /^i(P, Prec)][l - Pr(V')] 
= [f!:{R + 1, Prec) - .fr{R, Prec)]Pr(V) 

> 0. (32) 

i.e. fr{R, Prec) is also nondecreasing for the case the N = 
k + 1. By the induction argument, the result holds for the case 
that iV > 2. □ 

E. Lemma 6 

Lemma 5. When M = +00 and N < +00, the reverse 
cumulative distribution function fr{R,Prec) is supermodular 
onUxV. 

proof: To show fr{R, Prec) is supermodular on P x P is 
equivalent to proving the following is true: 

d\fr{R,Prec) 



> 0. 

dPrecdR - 

Since R is an integral variable, (l3Jt is equivalent to 

a/r(P+l,Prec) 9/r(P,Pr< 



(33) 



9Pr, 



dPr, 



> 0. 



That is, it is equivalent to showing gp' is nondecreas- 
ing in R. By the similar procedure in proof of Lemma 2] we 
show this holds. □ 
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F. Proof of Proposition Q] 

We prove the proposition by induction. Suppose that the 
time horizon consists of any T time slots. 

When t = T, Vt{R) = Ur = RB, and the proposition is 
trivially true. 

Now, we assume it also holds for Vt{R) when i = fc + 
1, fc + 2, T. Let 7? be a system state such that ^ > i?. By 
the hypothesis, we have Vk+i{R) > VA;+i(i?). Let tt* be the 
optimal policy. From the Bellman equation in dHJ, we have 

min{A/,W} 

Vk (i?) = [^«' + ^^^-+1 (^')] ' Vi? e 7^. 



R'=0 



(34) 



By defining a new system state —1 such that U-i 
/3Vfe+i(— 1) = 0, we can rewrite the equation in (l34t as 

min{J\/,Ar} R' 
fl'=0 i=0 

-[U^.i+l5Vk+i[i-l)]} 

mm{M,JV} 



R'=0 



min{A/,Ar} 



7r*(i?,) 



By lemma 5 in the Appendix, we have 



min{A/.JV} min{A/.7V} 

E E <P^vi^'e7^. 

i=fl' i=R' 

Then 

min{J\/,Ar} 

Vk{R) < E {[Ur' + m+i{R')] 

R'=0 

min{M.N} 



i=R' 



R,i 



min{Af,Af} 



R'=0 



R'en 

min{A/,Af} 



E P''^^^j^,\Un, + m+i{R')] 



R'=0 

i.e., for t = k, Vk{R) > Vk{R) also holds. This completes the 
proof. □ 

G. Proof of Theorem \5\ 

We first show that under the reference distribution, the 
optimal policy is attainable. 

Lemma 6. For the MRAS algorithm, the policy tt generated 
by the sequence of reference distributions {gk} converges 



point-wisely to the optimal spectrum access policy tt* for the 
adaptive channel recommendation MDP, i.e. 

Vim. Eg^l-KiR)] = 7r(i?)*,Vi? e 7^, (35) 

k—^oo 

lim T/ar<,j7r(i?)] = 0,VReTZ. (36) 

A;— >oo 

proof: The proof is developed on the basis of the results in 

First, from the MRAS algorithm, we have 

7fe < Ik+i, 

i.e. the sequence {7fc} is monotone. Since < 7^ < $7r* 
is bounded, there must exist a finite K such that 7fc+i = 
7fe,Vfc > K. 

When 7a' = $7r», we have 

lim Eg,[e'^- 1^^^>^^}] = e*-*. 

A:— J-oo 

holds. 

When 7if < $7r., from (fT2l i. we know that 

EgAe'^' h'f^>lk}] > ^3;o-i[e*''-^{*„>7fc}].Vfc > K. 

That is, the sequence {iSg^. [e*''/{$^>-y^}.]} is monotone and 
hence converges. We then show that the limit of this sequence 
must be e'^^' by contradiction. 
Suppose that 



lim Eg^e'^'l 



{*x>7fc} 



= e * < e 



Define the set 

= {tt : > max{7if , In 



-}}• 



Since 7^ < $^-«, the set Q is not empty by the continuous 
property over the policy space of MDP JB] . Note that 

9kW = _[ J_ ^ , —9iW, 



and 



lim 



fc^oo Eg^ [e*''^{*„>7j] 



> i,V7r e e, 



we thus have 



lim 5fc(7r) — oo,V7r G 8. 

fc— >oo 



By Fatou's lemma, we have 

lim inf / gk{TT)dTT 



k- 

= 1 

> lim inf 

A;— >oo 



.gfc(7r)d7r 



Tree 



> 



/ lim inf qk{TT)dn 
J Tree 



which forms a contradiction. Hence, we have 

lim Eg^e^- = e*-* 
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Since e*''/{$^>^} is a monotone function of $^ and one- That is 
to-one map over the field {tt : <I>7r > 7}, the result above 

implies that Vi7(/x, cr, 7^) 



lim Eg^[Tr] ^ TT* 

k^oo 

lim Varg^[K] = 0. 



(37) 
(38) 

□ 

To complete the proof of the theorem, we next show that 

Eg^[Ti{R)] = Ef^,^^^^)[K{R)],'iR e 7^, 
Eg.in'iR)] = Ef^^^^^^^[7T'{R)],yR G TZ. 
For the sake of simplicity, we first define a function 



/.(mepe^^/(7r(i?),0,a^,)^d7r(i?) 



Ma^(-R) 



Limev e^^f{7T{R), 0, aR)dTT{R) 

TT 



0. 



Since 



(7r(R)-MR)^ 



iref2 



min{M,7V} 

R=0 
min{A/,7V} 

n ^=^^ ^ 

min{A/,JV} 

II e "R '•'"R — ^^^^= 

min{M,7V| , 

W e ^"H/(7r(i?),0,aij) 

R=0 
min|M,7V| 

J] [e /(^(i?),0,a,i) 

ii=0 



It follows that 

/.(fl)Gpe'^/W^),0,afl)7r(i?)d^(i?) 



Vi? e 7^. 



LiR)ev'' f{niR),0,aR)dTT{R) 



4 



1 



e ^ 



By multiplying the same constant on the numerator and 
denominator of the terms on both sides, we have 



L{R)GV /(^(■^)' /^fl' crR)TT{R)dTi{R) 

L(R)<^v^^'^'^R)^^^R^'^R)'^'^^R) 



VR e 7e, 



Since 



^J■R'^iR) 

7r{R)eV '^R 



f{TT{R),0,aR)dTT{R)], 



we then obtain 

i?(/X,(T,7fc 

inin{M,Af} 

= E 

R=0 
min{M,Af} 

+ E 

mm{M,Ar} 



f{TT{R),fj,R,aR)dTT{R) 

n{R)eV 

= 1, 



cZtt 



TT&I ^R 



we obtain 



e('^-i'*'/{$,>7,} In /(7r(i?), 0, aR)dT: 



7r(i?)d7r 



f{T:{R),liR,cjR)T:{R)dn{R),yR e 7e, 

7r(_R)G-p 



I.e. 



.ln[ 



fiRTT{R) 



f{TT{R),0,aR)dTT{R)]dTT}. 



Itt(R)<^V CTfi 

Since the optimization problem in (fTsl l is to solve 
maxi7(/x,(T,7fe), 

/x,cr 

the updated parameters (/i.^., crfe) thus maximizes cr, 7^) 
It means that 

VF(Atfe,crfe,7fc) = 0. 



£;,j^(i?)] = i?;(,,^,,)[7r(i?)],vi? e n. 

Similarly, we can show that 

Eg,[iT\R)] = i?;(.,^^^)[7r2(i?)],Vi? e n. 
From (IJTI i. it follows that 

lim %(^,^^,^,)[7r] = lim Eg^lir] 



fc— )-CSO 



k^oo 
TT 



and, 

lim Far/(^.^.o.)[7r(i?)] 
= \im{Eg,[n'{R)]-Eg,[7TiR)]'} 

/c— >-oo 

= lim ^av[^(i?)] 
= 0. 

□ 
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